-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LST followups: better work divisions, concrete kernel dimension, some cleanup and fixes #47084
LST followups: better work divisions, concrete kernel dimension, some cleanup and fixes #47084
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47084/43263 |
A new Pull Request was created by @ariostas for master. It involves the following packages:
@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
Tagging @fwyzard since most (if not all) of the comments addressed were his |
test parameters:
|
@cmsbuild please test |
-1 Failed Tests: UnitTests RelVals-GPU Unit TestsI found 1 errors in the following unit tests: ---> test test-das-selected-lumis had ERRORS RelVals-GPU
Comparison SummarySummary:
|
there are a bunch of errors like
the same workflow step3 in the baseline ran OK. So, the crash seems related to this PR. |
assign heterogeneous |
I can have a look in the coming days, but if this is urgent for any reason go ahead and sign it, and I will still have a look after the fact. |
hold
Actually, you know what ? |
Pull request has been put on hold by @fwyzard |
Vec3D const blocksPerGrid_crossCleanpT3{1, 4, 20}; | ||
WorkDiv3D const crossCleanpT3_workDiv = | ||
createWorkDiv(blocksPerGrid_crossCleanpT3, threadsPerBlock_crossCleanpT3, elementsPerThread); | ||
auto const crossCleanpT3_workDiv = cms::alpakatools::make_workdiv<Acc2D>({20, 4}, {64, 16}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the X and Y values in the ranges are inverted with respect to before - is it intended ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there were a few places where I flipped the order so that the loops are nested in the recommended order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Vec3D const blocksPerGrid_crossCleanpLS{1, 4, 20}; | ||
WorkDiv3D const crossCleanpLS_workDiv = | ||
createWorkDiv(blocksPerGrid_crossCleanpLS, threadsPerBlock_crossCleanpLS, elementsPerThread); | ||
auto const crossCleanpLS_workDiv = cms::alpakatools::make_workdiv<Acc2D>({20, 4}, {32, 16}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here (OK, so it's probably intended).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
if (slope == | ||
kVerticalModuleSlope) // Designated for tilted module when the slope is infinity (module lying along y-axis) | ||
if (slope == kVerticalModuleSlope || | ||
edm::isNotFinite(slope)) // Designated for tilted module when the slope is infinity (module lying along y-axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makortel do you know if edm::isFinite
/edm::isNotFinite
is guaranteed to in device code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically they could presently work: the functions are constexpr
and presently use union
for type punning (that is strictly speaking undefined behavior). I'd like to replace the union
with std::bit_cast
that is constexpr
, but on the other hand e.g. https://stackoverflow.com/a/78232359 kind of suggests to use cuda::std::bit_cast
on CUDA 12.8.
I see in GCC 12 the std::bit_cast
implementation is just a call to __builtin_bit_cast
, and that e.g. in https://github.com/cms-sw/cmssw/blob/master/HeterogeneousCore/AlpakaInterface/interface/atomicMaxF.h we use edm::bit_cast
(that just forwards to std::bit_cast
or __builtin_bit_cast
) only for CPU implementation (I don't remember the exact reason for that though, whether the edm::bit_cast
didn't work on device code, or the intrinsics were "easier" on CUDA+HIP).
So perhaps for long term it would be better to define Alpaka-specific functions (ideally Alpaka could provide a portable bit_cast
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It currently does work, but I agree that would be better to be more careful about it (in a separate PR?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, should I also reimplement std::distance
to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(in a separate PR?).
To me a separate PR would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pointers and random access iterators you can just use b - a
instead of std::distance(a, b)
.
Anyway, from looking at the code a while back, I think std::distance(a, b)
should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't remember the exact reason for that though, whether the
edm::bit_cast
didn't work on device code, or the intrinsics were "easier" on CUDA+HIP).
I'm not sure either.
I've tried writing a simple kernel, and __builtin_bit_cast(float, i)
and __int_as_float(i)
compile to the same exact PTX code (basically a no-op).
unhold |
I see that we have it in Patatrack/CA code for half a year cmssw/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGeneratorKernelsImpl.h Lines 523 to 527 in dd230c0
#45542 (14_1_X) is it reasonable to rely on this case to let it go in this PR as well? |
Could |
we used |
OK, thanks, then I don't see reason to request further changes here. I'll see if we can fix |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @mandrenguyen, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
This PR addresses some of the LST followups that we have listed in #46746.
Here is the list of fixes/changes:
cms::alpakatools::makeworkdiv
(instead of our customcreateWorkDiv
) and we now usecms::alpakatools::uniform_elements
for kernel loops.kVerticalModuleSlope
(previously namedlst_INF
). We're doing this in two steps instead of one since the data files also need to be updated. We ensure a smooth transition by first supporting both options and later removing the legacy one.c.c. @slava77 @VourMa